Statistical mechanics of sparse generalization and model selection 
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One of the crucial tasks in many inference problems is the extraction of sparse information out of 
a given number of high-dimensional measurements. In machine learning, this is frequently achieved 
using, as a penality term, the L v norm of the model parameters, with p < 1 for efficient dilution. Here 
we propose a statistical-mechanics analysis of the problem in the setting of perceptron memorization 
and generalization. Using a replica approach, we are able to evaluate the relative performance of 
naive dilution (obtained by learning without dilution, following by applying a threshold to the model 
parameters), Li dilution (which is frequently used in convex optimization) and Lo dilution (which 
is optimal but computationally hard to implement). Whereas both L p diluted approaches clearly 
outperform the naive approach, we find a small region where Lo works almost perfectly and strongly 
outperforms the simpler to implement L\ dilution. 



PACS numbers: 02.50.Tt Inference methods, 05.20.-y Classical statistical mechanics 



I. INTRODUCTION 



The problem of extracting sparse information from high-dimensional data is common to various fields of scientific 
data analysis: computational biology, computer science, combinatorial chemistry, neuroscience, and text processing 
are just a few examples (see P, Q for a general introduction on the subject). Its importance becomes particularly 
evident in the analysis of biological high-throughput experiments. To give an example, the number of gene probes 
analyzed simultaneously ranges from the order of tens of thousands in gene expression experiments (e.g. ~ 30, 000 for 
human DNA chips) to hundreds of thousands in the case of single-nucleotidc polymorphisms (~ 500, 000 for standard 
1^1 ' genotyping platforms). The information about certain phenotypical traits is, however, expected to be contained in an 
a priori unknown, but small fraction (e.g. < 100) of all measured probes. These probes may act in a combinatorial 
way, making their one-by-one extraction impossible. As a further complication, also the number of independent 
measurements rarely exceeds the order of few hundreds. Therefore the problem of extracting information from few 
high-dimensional data points has become a major challenge in biological research. Both the extraction of features 
1 being related to the phentypical traits (i.e. topological information) and the construction of an explicit functional 
£^ . relation between the measured values of these features and the phenotype are of enormous interest. 

The literature about feature selection has so far been concentrated around two main strategies: (i) wrapper which 
utilizes learning to score signatures according to their predictive value, (ii) filters that fix the signature as a pre- 
processing step independent from the classification strategy used in the second step. In this work we will present a 
replica computation on a wrapper strategy which falls into the subclass of embedded methods where variable selection 
is performed in the training process of the classifier. More concretely, we will present an analytical teacher-student 
computation on the properties of a continuous diluted perceptron (i.e. a perceptron where a finite fraction of the 
coupling parameters are zero). Dilution will be introduced via an external field forcing the student to set as many 



variables as possible to zero. This external field will be coupled to the L p norm || J\\ p = J2i \ Ji\ p of the coupling vector 
of the student perceptron. For p < 1, the cusp-like singularity of this function in zero actually sets a fraction of all 
model parameters exactly to zero, as required for diluted inference. 

This strategy is not new, but so far, most of the more mathematically-minded studies in the context of linear 
regression and various non- linear models [!, 0, @> S Q have been concentrating (a) on the case p = 1 , which is 
the only case of a convex L p norm with a cusp in zero, and therefore dilution can be achieved within the framework 
of convex optimization (this case is well-known under the name LASSO [3J]); and (b) on the case of a large amount 
of available data (our model parameter a would scale like hiN instead of being constant as in our setting), where 
mathematically rigorous performance guarantees can be given. 

It is, however, obvious, that the most efficient dilution should be obtained for p — 0, where non-zero parameters 
are penalized independently of their non-zero value. The non-convexity of the Lq norm introduces computational 
complexity. Very few studies have been published so far for a binary sparse classifier: after a work of Malzahn 0], 
where the theoretical performance of a continuous and a ternary (i.e. ±1, 0) perceptron are compared, the problem of 
the inference of a classifier with discrete weights has been analyzed in [nj, [ill, [Q , where both a theoretical computation 
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for the average case together with a message passing alg orithm has been proposed. Another way of attacking the 
problem has been recently proposed by Kabashima in 13 , IbH [l5j , where a continuous perceptron whose variables are 
masked by boolean variables mimicking dilution. 

The article is organized as follows. In Sec. |H]we the describe the generalization problem, and the replica approach 
used for its analytical description. In Sees. IIIII and HVl we apply the general results of the replica trick to non-diluted 
generalization and L p diluted generalization respectively. The performance of the non-diluted, Li and Lo diluted 
generalizations are compared in Sec. [V] In Sec. I VII the memorization problem is treated as a noise-dominated 
limiting case of generalization, and at the end, the main results are reviewed and put in context in the conclusions 
IVIII Three appendices are added to clarify some technical aspects of the mathematical derivations. 



II. GENERALIZATION AND REPLICAS 



Two common problems in Machine Learning are the so-called Memorization and Generalization problems. In either 
of them, a number of patterns {x^, /j£ (1 . . . M)} are classified by labels y^, and one aims at memorizing or inferring 
a rule that reproduces the given classification. We will study these problems, for the perceptron with continuous 
weights. 

Let us consider the case of N binary variables Xi = ±1 defining each pattern x M . We assume the existence of a 
hidden relation among these variables and the labels y^ = ±1 of each pattern: 

y" = a°(x») . 

The function a°(x) could be, e.g., the one relating the activated/repressed expression states of genes Xi with the 
presence or absence y = ±1 of a disease, or with the expression of another gene not contained in x. Unfortunately, 
a°(x) is unknown and all we have in general is a set of M experiments f),/ie (1... M)}, linking patterns x^ 
to labels y^. In supervised learning these experimental data are used as a "training set" to infer the real relations 
among the variables. As a first approximation, one could mimic the output function a°(x) as the sign of a linear 
combination, 

cr(J,^) = Sign(J-£") , 

where the N weights Jj, also called couplings, are parameters to be tuned in order to reproduce the experimental 
(training) data. Such a function is called a perceptron. Here the weights Js are allowed to take continuous real values. 

The memorization [Tsl . [13] and generalization |l8l . IT9I ] problems concern the question of inferring the optimal values 
of the Js from the training data {(y^, x^)}. To this scope we define the training energy (cost function) 

M 

E(j) = '£G(-y»J-x») (1) 
/< 

counting the number of misclassified patterns when J is used to reproduce the training data. The function 0(-) is 
the Heaviside step function: <d(x) = 1 if x > 0, and zero otherwise. Note that the function E(J) depends only on the 
orientation of the vector and not on its length, i.e E(J) = E(cJ) for all c 7^ 0. 

In general, the real unknown output function a°(x) will be a complex one, and attempts of reproducing it by a 
linear perceptron may fail. This means that the training energy will eventually become non zero if the number of 
training patterns is sufficiently large. However, we will focus on the case of realizable rules, this is, when the output 
function a°(x) is actually a perceptron, and there is always at least one set of weights with zero energy. 

The possibility of non-realizability will be accounted for as a random noise affecting the output. In mathematical 
terms, the training patterns are generated by 

y" = <r°(x) = Sign(J° • x» + if) (2) 

where the noise 77^ are i.i.d. Gaussian variables, with variance 7 2 , and the hidden perceptron parameters J° are the 
rule we are interested to "discover" . We will refer to J° as the teacher, and to the free parameters of our problem J 
as the student, since the latter pretends to reproduce the patterns generated by the former. Note that the training 
energy does not change when J is multiplied by a global scalar factor. To cope with this invariance, we will look 
for student vectors subject to the spherical constraint J ■ J — N . 

In the zero noise limit (7 — * 0), there will be at least one student capable of correctly classifying any amount of 
training data, namely J = J°. Upon increasing the noise level (7 > 0), the correlation between the patterns and the 
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teacher becomes shadowed by the noise, and the student will need a larger amount of patterns to learn the teacher. If 
the noise dominates completely 7 — > 00, there is no information left in the training data about the teacher's structure, 
and the student will memorize all patterns up to a critical threshold above which starts to fail.. 

In the case of a feasible rule, the number of perfect solutions for the student (E(J) — 0) is generally large. The 
entropy of the space of perfect solutions is a decreasing function of the number M of training patterns, since every 
new pattern imposes a constraint to the student. We can further restrict this space by looking at diluted solutions 
inside the space of perfect students. A general dilution term can be added to the training energy to form the following 
Hamiltonian 

0H(J) =(3E(J) + h\\J\\ p (3) 

where ||J|| P = X^i \Ji\ p i s the L p norm of the student. The dilution field h will be used to force dilution, and non- 
diluted generalization correspond to h = 0. Among the different choices of p, the case p = 1 corresponds to the 
Li norm \\J\\i = |Jj| used in the celebrated Tibshirani's paper 0, while p = corresponds to the L norm 

|| J\\o — — where Sj is the Kronecker delta. A particular feature of the L p -norm is that, for p < 1, it sets a 

finite fraction of the model parameters exactly to zero, whereas it is convex for p > 1. The only parameter common 
to these two ranges is p — 1, explaining the popularity of the Li-norm for convex optimization approaches. 

In the following we apply the replica trick to compute the volume of the space of solutions [lflllol, as well as other 
relevant quantities (order parameters) for the generalization problem. 



A. Replica calculation 

Let us consider the space of optimal solutions for the supervised learning of a realizable rule. The standard situation 
would be that a training set {(1/^,2^)} of M experiments is presented to be classified by a linear perceptron with TV 
continuous weights Ji. The number of patterns relative to the amount of variables, a = M/N, will play an essential 
role as a control parameter. We define the Gibbs measure for the student vector J as 

p (f\- 1 -l3E(J)-h\\.T\\ v 

It depends on the inverse temperature f3, and the dilution field h. In the (3 — > 00 limit, the partition function 

N 



Z(j3,h) = Jf[dJi exp(-0E(f)-h\\j\\ 



contains only terms of minimal training energy. So, by computing Z we can obtain the properties of the desired space. 
Although not explicit indicated, the integration should is over the sphere J • J — N to remove the scale invariance in 
the energy term in Eq. ([3]). 

In the partition function above, the degrees of freedom are the N couplings Ji, while the X th and y^ present in the 
Hamiltonian is the so called quenched disorder. As we care about the properties of the solutions in the typical case, 
we will have to average over these quenched variables. In particular, the x^ will be i.i.d. random variables in {±1}^, 
while the labels are generated from the hidden structure of the couplings by equation The teacher weights J° 
too are i.i.d. i random variables distributed as: 

p(J a ) = (l-nl t )S J o+nl ! p'(J°) . (4) 

The first term introduces the sparsity of the teacher, and the second term contains all non zero couplings. Later we 
will use the letter t to refer to the variance of this distribution. The effective fraction of couplings rfi it — N if° , is the 
relative amount of non-zero couplings, and sparse models correspond to small nP t{ <SC 1. The fact that J° is involved 
directly in the computation will allow us to compare the student vector J to it. 

The free energy / = — 4 log Z is the relevant thermodynamic quantity, and the one that should be averaged over 

the quenched disorder. However, the direct integration over x M and J° in log Z is out of reach. To work around this 
obstacle, we use the replica trick i21], which consist of using the known property 

Z n — 1 

log Z = lim (5) 

n^O n 
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to average over Z n , instead of logZ, and sending n to zero afterwards. Note that Z n is the partition function of a 
n-times replicated system, if n is integer, which is the origin of the method's name. In our case the averaged and 
replicated partition function would be 



M N N n N 

& = *- mn e / / n^j?) / nn^ 



exp -/?EE 

^ a— 1 /^— 1 



i=l 
N 



a—1 i—1 

N 



i=l 



E J " a 



^En /a i 



(6) 



where D 7 7y Al stands for the Gaussian distributions of the noise variable ry^, with variance 7 2 



_ i» 

e 

D 7 77 M = — 7==d?7 AJ 



7V 27T 



This notation will be used throughout the paper, and if the subindex 7 is omitted, it refers to 7 = 1. 

After some standard steps detailed in appendix [XJ the replica symmetric estimate of the free energy is obtained as 

- (3f = extr^^^A j-rf + ^qq - X + Gj + aG x 

The order parameters q, q, r, r and A were introduced via Dirac-delta functions in the replica calculation. In particular 
q = N^ 1 < J a ■ jk > is the overlap between two (independent) students solutions. The notation < • > stands for the 
expectation value w.r.t the Gibbs measure. Note that < q < 1, it will be 1 when the Gibbs measure is condensed in 
a single J, and it will be smaller than one when the measure is more spread. The parameter r = N^ 1 < J ■ J° > is 
the overlap between the student vectors and the teacher, and will be crucial in our understanding of the performance 
of generalization. The parameters q, f, and A are the corresponding associated Fourier variables (to represent the 
deltas introduced in the replica calculation). The last one, A, corresponds to the spherical constraint J ■ J = N. 
The terms Gj and Gx are given by 



Gj 



Bx / dJ°p{J°)log / dJexp 



h\\J\\ p + (fJ° - V$x)J 



(J) 



G 



x 



BxH 



y/ 'qi 1 + qt - r 2 ) 



log (e-' 3 - 1)H{- 
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with H{x) = -^= e y2 / 2 . From the replica calculation, the term Gj can be interpreted as the effective free energy 
of a single J. The inner term Zj(J°,x) = J dJexp ^— (| — A) J 2 — h\\ J\\ p + (fj° — ^/qx)J^j plays the role of a single 

J partition function, while the term logZ,/ corresponds to its free energy. The dependence of Zj(J°,x) on J° and x 
is conditioning the free energy of the single J to the different values J° of the corresponding element in the teacher 
vector, and to the effective "noise" from the realization of the training patterns x^ . So the integration over J° and x 
gives the average effective free energy of a single J. 

This interpretation of Gj allows also for formulating the following joint probability distribution of x, J° and J 



p(x 1 j°,j) = e - 7 =Lp(j ) 



-{i~x)j 2 ~h\\j\\ p +(fj°~Vqx)J 

/27T ' ' 'J &J e -ii->.)J 2 -h\\J\\ p + (rJ«-Jqx)J 

such that any expectation value of a generic function g[x, J° , J) can be found as 



(8) 



E[g(x,J°,J)\ = J dx J d.J° J dJ g(x,J°,J)P{x,J°,J) 



The limit (3 — > 00 is trivial in Eq. (0. It concentrates the Gibbs measure onto the subspace of students with 
minimum training energy (error), and in the case of a feasible rule to the perfect solutions E{J) — 0. The actual 
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values of the variational parameters are determined by the saddle-point condition for the free energy V g . r ..../ = 0. 
With all the previous definitions, at zero temperature (/3 — * oo) this condition is given by 

f ( xry/1 - q \ x 

Dx H 



q y^^/il - q)q J \ V ' qi 2 + qt - r 2 J H(y/qx) 

-2a 

Dx x iogH 




2tx\J <77 2 + qt — r 2 

q = l + -=E[xJ] (9) 

vq 

r = E[J°J] 
1 = E[J 2 ] 

This set of equations has to be solved numerically for each a = M/N and each dilution field h. The resulting values 
of the variational parameters q, r, q, f and A are used to describe the solution space. For instance the generalization 
error, i.e. the probability that a new pattern (independently generated from those used for training) is misclassified 
by the student, depends only on the overlap between teacher and student r (see [l9j |) 

I r 

e = — arccos — (10) 

The square root of the variance of the teacher t = J dJ°p(J°) J° 2 is required because the teacher is not necessarily 
normalized to unity. 

The solution of the fixed point equations can also be used to construct the Precision vs Recall curve, which is a 
standard check for a classifier. In the case of model selection we can use the information given by the student solution 
J to classify the couplings as relevant Ji > J t h or not relevant Ji < Jth, where Jth is a sensibility parameter. This 
means that we will disregard all inferred Ji which are not strong enough. With the joint probability distribution ([5]) 
we can compute the probability of having any of the following situations 

True Positive TP J l ^ J° ^ 

False Positive FP J t ^ J° = 

True Negative TN J t = J° = 

False Negative FN J 4 = J° f 

For instance the probability of having a true positive (TP) is Ftp = E[9(| J| — Jth)(l — Sjo)]. 
The recall (sensitivity) and the precision (specifity) are defined as follows 



RC= PtP =^ PR = „ PTP n (11) 
Ptp + Pfn n° eS Ptp + Pfp n% 



where nP tt is the real sparsity of the teacher (see (Q|) and = E[| J| > J t h] is the dilution of the student when the 
threshold value for a relevant coupling is Jth- Note that both the recall and the precision depend on J t h, as well as on 
the variational parameters g, r, q, f and A that solve the fixed point equations (J9j> . The PR-RC curve is the parametric 
curve RC(Jth) vs PR(Jth)- the closer we can get to RC = 1 and PR = 1, the better the student perceptron has 
understood the topological structure of the teacher. 



III. NON-DILUTED GENERALIZATION 



To avoid confusion we will call sparse the case of teachers with many trivial couplings = 0, while the term 
diluted will be saved for the generalization method (non-dilutcd/dilutcd). The replica calculation hitherto developed 
is general in a set of aspects. First, the teacher distribution |4]) can be of any kind, including a non sparse teacher 
nP a — 1, although we will focus on the case of sparse models. Second, the possibility of a non-diluted generalization 
can be accounted by setting the dilution field h — 0, and for h ^ different choices of regularization are possible. 
In this paper we will show the results for L and Li . For each of these cases (non-diluted, L and Li ) the replica 
calculation has it's particularities, which we present hereafter. 
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The simplest case is the non-diluted generalization (h = 0), as some of the equations simplify considerably, being 
equivalent to those in [r|. The absence of the dilution term in ([7|) makes the expression integrable, such that 

f 2 t 4- n 1 

The first two fixed point equations in @ do not change, while the last three can be reduced to two algebraic equations 
without A, 

q = {f 2 t + q){l-qf (12) 
r = ft{\ — q) 

The value of A can be recovered using (1 — q)(q — 2 A) = 1. The fixed point equations can be solved numerically for 
evaluating the generalization error (fTU|) as well as the PR-RC curve. The calculation of the expectation values using 
© is also simplified since 

V ' y/2n(2q-2X) 2(2g - 2A) V ; 

and the terms involved in recall and precision are easier to obtain. 




1 1 1 1 1 1 1 1 1 1 1 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

RC 



FIG. 1: The Precision-Recall curve for the non-diluted generalization for different values of a. For growing values of the amount 
of training data a, the curves approach the PR = 1 line, meaning that the student solution is doing an almost perfect model 
selection, for a certain value of the sensitivity threshold J t h- 

Let us take as a toy example the case of a sparse teacher with only n° ( = 5% non-zero couplings. We set the noise to 
7 = 0, such that there is always a perfect student solution. In particular, we will use a discrete teacher J° e { — 1, 0, 1} 

p(J°) = (1 - 0.05)^a + 2|5 (Sjo^ + Sjo A ) (14) 

With such a simple structure it happens to be the case that teacher's dilution and variance are both equal nP tt = t = 
0.05. 

The solution of the fixed point equations (([9|) and (fT2|) ) is found numerically for different values of the amount of 
training data a. For each a, the different PR-RC curves are shown in Figure [T] It is clear from the figure that for 
sufficiently large a, for instance a > 2.0 in this example, the generalization is capable of a good classification of the 
couplings J, achieving both high precision and high recall, i.e. a good performance in model selection. This is seen 
in the figure as a curve that approaches the PR = 1 line. 

It is no surprise that more training data results in better model selection. However, we can gain some information 
about how the solution approaches perfect model selection by looking at the statistical distributions of the student's 
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FIG. 2: The statistical distribution of the students couplings J for three different values of a. When enough training data is 
given (a — 2.5 in this figure), the distribution P(J) can be recognized as the superposition of Gaussian distributions located 
around the discrete (and rescaled) values of the teacher's couplings. In such a case, setting to all those student's couplings 
Ji that are around results in a nearly perfect model reconstruction. As less information is used for training, the Gaussians 
overlap, and any threshold for the relevance of a coupling J t h will misclassify some couplings, resulting in a worse model 
selection. 



Js. Figure [5] shows how P(J) — J dJ°P(J°, J) (Eq. (fT5]0 concentrates around the discrete (and rescaled) values 
of J° with a set of Gaussians that have neglectable overlap for large values of a. Above a critical a ~ 1.6 (in this 
example) we can start to discriminate the Js from different Gaussians because local minima in P(J) emerge. It is 
expected that above this point a reasonable value for Jth is the one satisfying 

dP(J) d 2 P{J) 
dJ dJ 2 

This choice for J t h leads to the recall and precision, which can be seen in Figure [T] marked by the square symbols. 
For a = 2.5 the Gaussians in P(J) are almost perfectly distinguishable. The optimal choice for J t h bas a precision 
and a recall near one (RC = 0.991, PR = 0.996), meaning that generalization is achieving an almost perfect model 
selection. 



IV. DILUTED GENERALIZATION 



At zero temperature (infinite 0) and h = 0, the Gibbs measure gives the same probability to all perfect student 
solutions, since they have the same energy E(J) = 0, while suppressing completely positive-cost students. Working 
at zero temperature, a dilution field h > gives the chance to impose a different measure over the set of perfect 
solutions. This measure favors the students with the lowest values of || J\\ p , and in the limit of h — > oo, it concentrates 
in the perfect solution with the highest L p dilution. We will now study the properties of the subset of perfect students 
with the smallest L p norm. 

Unlike the trivial /3 — > oo limit (see Eq. ([7])), the large- dilution limit has to be taken carefully as some parameters 
diverge. For large dilution field h — > oo we have q — ► 1, meaning that different students are very close to each other, 
and in the limit h = oo there is only one student which is at the same time zero-cost (E(J) = 0) and maximally 
diluted. As can be seen from the fixed point equations ([5]), when q tends to 1, the order parameters q, f and A diverge. 
The scaling behavior of these variables is the following 

(l-q) ~ f f ~ Rh 

q ~ Qh 2 f -A ~ fh { 0) 
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In terms of these new variables, the fixed point equations for q and r © become 



Q 



R 



a 



7TQ 2 
a y/j 2 + t 



ArcCot 



r r\J 7 2 + t - 

V7 2 + t - r 2 l 2 +t 



Ml 2 + t) 



(16) 



and do not depend on the dilution p. Furthermore, this scaling makes the exponent of the exponential term inside 
P[x, J°, J] (Eq. ©) proportional to h. So, in the remaining three equations 



-.E[xJ] 



r = E[J°J] 
1 = E[J 2 ] 



(17) 



The specific details of 



the expectation values for h — > 00 are dominated by the largest values of the exponent in 
this saddle-point calculation depend on the actual dilution || J\\ p . 

Among all possible values of p, the cases p — 1 and p — are special for both their meaning and their simplicity in 
the calculations. The Li norm is extremely popular in machine learning because it maintains the convexity of a convex 
cost function while forcing sparse solutions [3|. On the other hand, Lo lacks completely of the convexity preserving 
property (it is not even continuous), and therefore is not a suitable penalization for convex optimization. However, 
the Lo norm is optimal in the sense that it does not deform the Hamiltonian beyond penalizing non-zero couplings. 
We will compare the dilution achieved by the Li approach with the largest possible dilution (the one obtained using 
Lo ), and give a qualitative description of this widely used regularization. Another simple and common choice for the 
penalization is p — 2, but in our model setting it is meaningless since the student is constrained to the sphere and 
therefore has a fixed || JH2 = N. 



Li dilution 



We first discuss the case of Li -regularization ||J||i = ^2 i=1 \ Ji\- For h — > 00 the expectation value of an arbitrary 
function g(x, J° , J) is given by 



ng(x,J°,J)} = Jvx J dj° P (j°) 



5 (a;,J ,0) 



\RJ° 



iQx\ < 1 



j0j RJ°-V$ X - Si MRJ°-V$ X ) ) Qtherwisa 



The derivation of this expectation value is shown in appendix |SJ As already mentioned, one of the virtues of the Li 
-regularization is that it forces the solution to be diluted by setting a fraction of the couplings exactly to zero. This 
fact becomes evident in the previous equation. 

Using the scaling behavior (| 1 5 [) . and calling S(J°,x) = Sign(i?J° — yQx), and defining the functional 



L[] = Jbx [-M\RJ° - \[^x\ - 1) , 



the resulting fixed-point equations in the h — > 00 limit are (|16j) and 

1 



(18) 



1 



- / dJ° P (J^ 



RJ° 2 L[1] - \fOj°L[x] - J L[S(J a ,; 



K = (Q + l)Q + rR+— I dJ°p(J°) 



\/C>L[xS{J Q ,x)} - RJ°L[S{J°, 



Note that the original parameter q is no longer present, since it is 1, but the overlap between teacher and student, 
r, is still a non trivial order parameter. 
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B. Lo dilution 



For the Lo -regularization the dilution term in the Hamiltonian ([3]) is ||J||o = Ylj (1 — <5./ i ) , punishing only the 
fact that a given Jj is non zero, but otherwise making no distinction between different non-zero J-values. In a 
strict mathematical sense, introducing the L norm in the Hamiltonian is meaningless, since a finite single-point 
discontinuity cannot alter the integration over the continuous range of J-values in the partition function. So the Lo 
norm can only be understood as the limiting case p — ► +0 of a family of continuous functions (see appendix ITT)) . 

Using a similar approach as the one presented in appendix |5] for Li , the expectation value of an arbitrary function 
g(x, J , J) in the h — > oo limit reads 



E{g(x,J°,J)] = J Dx JdJ°p(J°) 



g(x,J°,0) 
g(x,.r , 



otherwise 



(19) 



The fixed-point equations for Q and R are exactly the same as for Li -dilution (eq. (TIT))) ), while the other three 
order parameters are now given by 



Q = ^ / d./'Vi./":, 



, = - / djV(j°) 



M[l] - -^=M[xS{J°,x)] 



RJ° M[l] - \jQJ°M[x] 



(20) 



where M[] = fDx [■] 



K = rR + QQ 
■ \kJ°-^x\ _ ^ and m[xS(J°,x)] is defined 



M[xS(J , x)\ — exp ^ cosh ■ 



2tt 



2Q 



C. Dilution, recall and precision 



The numerical solution of the fixed-point equations for the Li and Lo dilutions gives us the overlap r between the 
teacher and the student. The generalization error is obtained using Eq. (|10|) . 

The most striking effect of the norms is the emergence of an extensive number of couplings that are exactly zero 
(see P(J) in appendices). The fraction of non-zero couplings is the effective dilution n ott achieved by the student, and 
it is obtained as 

^-*w>*-{i%M-' )K iizz <-> 

It is expected (and numerically observed) that for large values of a the effective dilution n cff converges to the real 
dilution of the teacher n° f . 

Along the same lines developed for the non-diluted case, we can further restrict the set of non-zero couplings by 
setting a threshold for relevant couplings. In other words, we interpret as non-relevant all those couplings that are 
not strong enough, Ji < | J t h \ ■ In this case, the fraction of relevant couplings equals 

n% = E[Q(\J\ - J th )} 

and can be used to calculate the precision according to Eq. dTTJ) . The other terms appearing in recall and precision 
are also computed using the expectation value E[-] for each dilution scheme. For instance, the probability of having 
a false positive is given by Ppp — E[6(| J| — Jth)S.j°}- 



V. HOW WELL DOES DILUTION WORK? 



The discussed mathematical machinery can shed some light on this question. To see the differences between diluted 
and non-diluted generalization, and its performance in sparse model selection, let us use the same toy example used 
for the non-diluted case, with a teacher of dilution n° ff = 5% and discrete values Jf £ { — 1, 0, 1}, see Eq. HI]). 
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FIG. 3: The dilution n d t(a) achieved by the Li and Lo dilutions, as a function of the amount of training patterns a. The Lo 
regularization approaches the dilution of the teacher n% { = 0.05 from below, including non zero couplings only when strictly 
required to correctly classify the training data. The Li dilution is not that efficient, and for a > 0.45 it uses more non-zero 
couplings than actually needed. 

The functions n etl (a) for the Li and Lo dilutions are presented in Fig. [3l It can be seen that Lo -diluted generalization 
goes monotonously from below to the correct value n cff = 0.05, in a somehow Ocams-optimal way. In other words, 
Lo dilution adds non-zero couplings just when strictly required by the empirical (training) evidence. The Li norm 
isn't that effective. It is an interesting result that, for a certain range in a, the Li optimal solution requires more 
non-zero couplings (n eff > n® t{ — 0.05) than actually present in the teacher. This overshooting is the cost we pay for 
deforming the Hamiltonian by the Li -penalization of large couplings. Unlike Lo regularization, Li approaches the 
correct dilution from above, not from below. 

One could be tempted to call the change of slope of Lo near a = 0.8 in Fig. [3] a transition to perfect student 
solution, but it is not. The generalization error in Fig. [4] shows that errors persist also for larger a. On the other 
hand, while the Li norm goes smoothly to e = 0, the Lo undergoes an abrupt reduction of the generalization error 
near a = 0.8. This might be a sign of a transition to (almost) perfect model selection, such that for a > 0.8 the 
student has identified the correct Ji — 0, and its mistakes are restricted to the actual values of those J^s that are 
non-zero. 

To compare the Li and Lo dilutions to non-diluted generalization, we show the PR-RC curves in Fig. El for four 
typical values of a. The curves for the diluted generalization seem to miss the right part - but they are not. As the 
Li and Lo methods set a fraction of the couplings exactly to zero, lowering the threshold J t h will never achieve to 
include them as non-zero couplings, and this is why we can not arrive at recall equal to one. 

Looking to the PR-RC curves, the first obvious fact is that the non-diluted generalization is much worse than any of 
the diluted ones. The next interesting fact is that Li performs slightly better than Lo for low values of a, something 
that could be seen also from the generalization error (Fig. [4j. Finally the sudden change to PR = 1, for a — 0.8, of 
the PR-RC curve for the Lo dilution is saying that Lo fastly moves to almost perfect model selection, as we guessed 
from the generalization error curve. However, there is no critical a, and the sudden change is not a phase transition. 
This can be seen more clearly working with less diluted teachers (for instance nP tt = 0.1, data not shown). To gain 
some more understanding of the onset of an almost perfect model selection it is interesting to see the distribution of 
couplings, P(J), which are shown in appendices [Bl and [U1 

VI. THE MEMORIZATION LIMIT 

In the calculations presented so far, the noise jf affecting the output y M in Eq. @, was neglected by setting its 
variance to 7 2 = 0. By doing so, we guaranteed that for any a, there is always at least one zero-cost solution for the 
student, namely J = J°. Let us now study the opposite extreme case where the noise is extremely large. In that case, 
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FIG. 4: The generalization error at different a. The two curves at the center are the generalization error achieved using Li and 
Lo dilution. While the Li error decreases smoothly with the training data, the one corresponding to Lo undergoes a sudden drop 
near a — 0.8. This is a consequence of a sudden move to almost perfect model selection, where the set of non-zero interactions 
has been identified with very good precision. Both, the superior and lower curves, correspond to the non-diluted generalization, 
where the lower one has been plotted with rescaled z-axis, e(a/n° f ). If the student could know from the beginning which are 
the non-zero couplings, it could use all the training data to tune the values of these couplings, resulting in a huge reduction of 
the generalization error. 



the output function is given by 



" - a°(x) = Sign« 



i.e. the patterns are randomly classified by — ±1. The teacher's couplings J 4 ° become irrelevant, and the student 
will try to learn (generalize) a non-existing hidden relation. This limit is equivalent to the well-studied memorization 



problem of random input-output relation. It is a classical result [17[ that for a > 2 the student will fail to correctly 
classify all patterns, while for a < 2 the student can find a solution of zero energy. In the latter case, the student vector 
J reproduces correctly the relation between the M patterns x^ and the corresponding labels j/ M by = Sign(J • x^). 
The student was capable of memorizing the labeling of the input patterns. 

Therefore memorization can be studied as the noise-dominated limit of generalization. By computing the limit 
7 — > oo in the fixed point equations for q and f ©, we found that f — while 

, Bx TT * . (22) 

^2^(1 -q)J H(^qx) 

The expectation value of a function g(x, J) becomes 

mg{x, J)] = / Bx J aj ^ x > J > e (23) 

In particular we have r = E[J° J] = 0, meaning that the overlap between student and teacher is zero, which is an 
obvious consequence of the large-noise limit. So, the set of variational parameters describing our problem reduces to 
q, q and A, and the fixed-point equations are (|22|) and 



q = 1 + -^EtaJ] 1 = E[J 2 ] . 

The generalization error and the precision- vs. -recall curve are meaningless in this context. However, we can still 
check the efficiency of the hi and L memorizations in using as few as possible non-zero couplings to memorize a set 
of patterns. There is a first trivial conclusion, coming from the already stated fact that a continuous perceptron is 
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FIG. 5: The PR-RC curves of the non-diluted, Li diluted, and Lo diluted generalization for four different values of a. Both, 
the Li and Lo dilutions, outperform the non-diluted generalization. Particularly, the Lo dilution moves suddenly to almost 
perfect model selection near a — 0.8. 



capable of memorizing without error until a = 2. This means that a perceptron with N couplings Ji can remember 
the classification of M = 2N patterns. It follows directly from this that if a < 2 patterns are given, we can set to 
zero any fraction 1 — a/2 of the couplings, and still be capable of memorizing without error with the remaining a/2 
couplings. We are interested in how much more dilution can be obtained by the introduction of a dilution term in 
the Hamiltonian. Note that if, instead of setting to zero a random group of (1 — a/2)N couplings, we optimize their 
selection, we can go far below the trivial n off — a/2 dilution. 

We can solve the fixed-point equations in the limit of large dilution fields h — > oo. Once again the solution space 
reduces to only one solution q — > 1, so there are some divergences in the equations. The scaling behavior of the 
variational parameters is the following 



(1-9) 

q 

f-A 



9. 

h 

Qh 2 



Using this scaling, the expectation values are given by 



g(x,0) 

E[g(x,J)] = I Dx i /a a . f , 

I 1 ^ x VQ£-Sign(s) N 



x 2 < 4 

Q 

otherwise 



for the Li case, and 



for the L case. 



E[g(x,J)} 



g{x,0) 



x 2 < 2£ 
Q 

otherwise 
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FIG. 6: The maximum dilution achieved when using the Li and Lo regularizations in the Memorization problem. The trivial 
dilution 7Wf = a/2 is outperformed by either Lo and Li dilutions, and Lo is the more efficient of all. For a > 2 the student 
fails to memorize all patterns and there is no student solution of zero energy. 

Solving numerically the corresponding fixed point equations, we can compute the dilution achieved by each method 

"-off = 1 - D [ S J,o] 



21! (Q- 1 / 2 ) Li Norm 

2H (VTKQ- 1 ' 2 ) L Norm 



The resulting functions n etf (a) are show in Figure [Sj It shows that both Li and Lo achieve a much stronger dilution 
than the trivial random one. As in the generalization case, the Li regularization works worse than Lo , the reason 
being that it penalizes large J values. The dilution achieved in memorization for the Li and Lo dilutions is always 
above the corresponding generalization curves in Fig. [3] Although not shown in either figures, we checked that near 
a = the corresponding curves coincide, as there is no difference between learning and memorizing when too few 
training data are given. 



VII. CONCLUSIONS 



In this paper, we have presented an analytical replica computation on the generalization properties of a sparse 
continuous perceptron. Dilution has been achieved in different ways: First, it can be imposed naively by using non- 
diluted inference, followed by deleting all those couplings which are below some threshold value. Second, it can be 
achieved by introducing a dilution field which is coupled to the L p -norm of the coupling vector, penalizing thereby 
vectors of high norm. For p < 1, the cusp-like singularity of the Lp-norm in zero forces a finite fraction of all couplings 
to be exactly zero. We have studied in particular two special cases: (i) p = 1 is a popular choice in convex optimization 
since it is the only value of p which corresponds both to a convex penalty function and dilution, (ii) p = achieves 
optimal dilution since it penalizes equally all non-zero couplings independently on their actual value, but due to the 
non-convex character of this penalty, it easily leads to computational intractability. 

As a first finding, we see that both L p schemes work fundamentally better than the naive scheme, both in the 
questions of model selection (i.e. for the identification of topological properties of the data-generating perceptron 
given by its non-zero couplings) and in the generalization ability. For a very small or a very large amount of training 
data, Lq and L\ achieve very comparable results. We find, however, an intermediate regime where Lq suddenly 
improves its performance toward almost perfectly model selection, whereas L\ dilution shows a more gradual increase 
in performance. This is very interesting since this regime is found for relatively small data sets, and in many current 
inference tasks (e.g. in computational biology) the quantity of data is the major limiting factor for the computational 
extraction of information. It mi ght be in this parameter region, where statistical-physics based algorithms like the 
ones presented in [l(| EH E2, ESUIL EH may outperform methods based on convex optimization proposed in 0] . 



14 



These analytic results call for efficient algorithms in real case studies. At odds with the linear-regression case with 
Iji norm, in the case of a continuous perceptron, a simple gradient descent strategy does not work due to the presence 
of a zero- mode in the energetic term Eq. ([T]) (E(J) = E(cJ) for every scalar c > 0). The zero-mode has been removed 
in the computation by fixing the modulus of the classification vector (J ■ J = N). Unfortunately this spherical 
constraint breaks the convexity of the problem and it is not clear if there are more ingenious ways for removing the 
zero-mode that could work, at least in the Li norm case. Another possibility that we are planning to follow is that of 
considering variational approximation schemes like belief propagation for continuous perceptrons [ill . Il2l . [l5| , which 
are able to overcome also the problem of the non-convexity of the Lq norm. 

During the preparation of this manuscript, a related study on the efficiency of L p dilution in systems of linear 
equations was posted online [22| . Also there, the relative importance of Lq and L\ dilution was studied, with 
conclusions which are highly compatible to ours. 
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APPENDIX A: REPLICA CALCULATION DETAILS 



The calculation of Z n in Eq. Q is done by the introduction of an overlap matrix Q a ,b using constraints 

S(NQ a , b -J2^J") 



for any < a < b < n. As there is a symmetry in the replica indices Q a ,b — Qb,a, only the half of the matrix is needed. 
The value a = refers to the teacher, while a = 1 . . . n to the n-fold replicated student. Among these constraints, 
there are some that are particular. For instance the term Qo,o is the variance of the teacher, and it should be equal 
to the variance t of the teachers distribution Similarly, the n terms Q a , a are set to 1, in order to impose the 
spherical constraint on the student, since the energy ([T]) is invariant to elongations of the student vector. 
Using Fourier representation of the Dirac-deltas, the replicated partition function is 

dQ a , b dQ a ,bd n \ a 



— exp iN E Qa.bQa.b + iN^X a + iMXo 



D 7 »7 



j f d n j a p(j o )e xp (-hJ2\\J a \\ P ~*Y, ja Q^ Jb n 

Y a a<b J 

™ nx " «p(-/3 E 9 H x ° + v)* a ) + * E xa * a \ E x a Q a .a b ) 



(Al) 



where Q a ^ are the conjugated parameters in the Fourier representation of the deltas. In particular, A and A a are the 
one corresponding to the teacher variance and the spherical constraint. To save some space, we used the short-hand 
notation d n A a as a substitute for n™=o d^ a j and dQ a b for the differential of all the terms in the overlap matrix. 

The next step in the replica calculation is to assume a structure for the overlap matrix. In the replica-symmetric 
case, the overlap matrix and its Fourier counterpart have the structure (exemplified for n = 3) 



r 1 
r q 1 
\r q q 



) 



\ 



( A 

f X 

f q A 

f q q A J 



(A2) 



The Fourier mode corresponding to the variance of the teacher t, can be shown to be A = 0, while that of the 
spherical constraint remains a variational parameter A = —i\ a . The other parameters are q, the self overlap between 
two student solutions, r, the overlap between an student and the teacher, and their conjugate Fourier modes q and f. 

It is a standard feature of the replica trick to invert the order of the limits, doing N — > oo first, and then n — + 0, profit- 
ing thereby of the saddle-point method to solve the integral in (|A1[) . Note that the last two lines in (|A1|) can be brought 
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to the exponential by using X — explogX. Thus the value of the free energy — j3f = lim„^o linijv^oo -^logZ™ is 
given by extremizing the equation 

-f3j= -rf + -qq- X + Gj + a G x 
with respect to the variational parameters (q, r, q, f, A), where we have introduced 
G.i = Jdx /d/V(J°)log J dJ s -'i- nf '- h '"'*'""- A " 

and 



APPENDIX B: LIMIT h -> oo 



The scaling behavior of the parameters q, r, q, f and A in the limit h — > oo 

g ~ Qh 2 | -A ~ f/i 

were first obtained by looking at the solutions of the fixed-point equations for growing values of h, and their consistency 
was checked later in the fixed-point equations. Considering this scaling, the expectation value of a generic function 
g{x,J°,J) is given by 



V[g( X , J°, J)] = / D* dJMJ T gl ' ' — . or . (Bl) 



The diverging prefactor /i in the exponentials forces the main contribution to the J-integration to come from the 
largest value of the exponent (saddle-point approximation): 

J* = argmin ( y J 2 + || J|| p - (RJ° - yJQx)j) (B2) 



j 



In the case of the Li norm (|| J||i — \ J\) the solution of the previous equation is given by 

jH o |A7°-V5r|<i 

g^M^l otherwise 



The expectation value is thus 

E[ ff (x,J°,J)] = | Dx | dJ°p(</°) 



g(x,J°,0) \RJ°-\JQx\<l 
g(x, J°, otherwise 



The probability distribution of the students couplings P(J) can be obtained as P(J') = E[5( J — J')] resulting in 

K f (flJ°-SignJ-KJ) 2 

P(J) = (1 - n cff ) 6 (J) + — / dJ° j o( J°)e 53 

where ra cff = 1 — J Da; / d J°p( J°)<d[\RJ Q — \J~Qx\ — 1]. The continuous part of this distribution is shown in Fig. [7] 
for the same four values of a for which the Precision-Recall curves were shown in Fig. We can see that for growing 
values of a, the distribution P(J) is more concentrated around the discrete values of J, and the amount of couplings 
that are small but not zero, reduces continuously. This explains the high performance in model selection of the Li 
dilution for a > 1.0. 

Note that the calculation of the smallest value of the exponent in (|B1[) is particularly simple for Lo and Li . Other 
values of p may require a numerical solution. It is simple to see that in the p > 1 case no dilution is obtained. 
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FIG. 7: The distribution P(J) of the student couplings for the four values of a used in the Precision-Recall curves of Fig. [5] 



APPENDIX C: L AS THE p -> LIMIT 

The Lo dilution corresponds to a term 5Z i (l — in the Hamiltonian ([3]). However, the Kronecker delta is zero 
for all non-zero arguments, with an isolated and finite discontinuity in the origin. This single-point discontinuity is 
irrelevant in the integration over continuous Js in the partition function as well as in (|B1|) . Therefore using the Lo 
dilution from the beginning gives the same results as the non-diluted case h — 0. Nevertheless, we can still interpret 
the Lo norm as the p — ► +0 limit of the L p norm. 
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FIG. 8: The p = and p > cases of the function Y P (J) — J 2 + || J\\ p — (RJ° — y Qx).J in the two characteristic situations 
where J* = and J* > 0. The closer p to zero, the closer the function Y P (J) is to Jq(J). 

For general p > there is no explicit solution for Eq. (|B2|) . We will argue that the limit p — > of such solutions is 
exactly the solution of 

Jo* = argmin ^ J 2 + (1 - 5,/) - (RJ° - \[~Qx)J^ 

just as if we would have introduced the Lo norm from the beginning, and taken naively the saddle point including the 
isolated singularity. There are two candidate values for Jg , one is and the other one is the zero-derivative point of 

the quadratic function Jq = rp^ e i a ^ er w jri ^ ne ac ^ ua i solution if and only if 

2T < 
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If the opposite inequality is satisfied, the solution is Jq =0. Both situations are shown in Figure [H The function 

(RJ° — \/0) 2 

\J\ P tends to 1 as p — ► for all J ^ 0, so we have also J* — > Jq whenever 23 Y 7^ 1. The point where the 

equality holds corresponds to the neglectable case when the value in J = is exactly equal to that in the point of 
zero derivative. We conclude that except for this single point, J* — > Jq as p — > 0, and therefore we can replace the 
Lo norm directly into the steepest descend condition to obtain the p — > result. 

0.035 I 1 1 1 1 1 1 1 1 
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FIG. 9: The distribution P(J) of the student couplings for the four values of a used in the Precision-Recall curves of Fig. [5] 
Comparing also this result for Lo with the one for Li in Fig. [7] we can understand the difference in their performance. 

Repeating the steps shown in appendix [Bj a similar computation for the Lo dilution gives the expectation value 
reported in Eq. (TIT))) , and the following probability distribution for the student couplings 

p(j) = (1 - n rff ) 6(j) + * I dj°p(J )^ (flJ °^ J,2 e(|j| - Jh 

This distribution is shown in Fig. [5]for the same four values of a for which the Precision-Recall curves were shown in[S] 
Note that the main difference between this distribution and the corresponding to the Li dilution Fig. [7] is the presence 
of the O(-) function in the former. When the Gaussians of the continuous part of the distribution have a standard 
deviation smaller than the gap in the O function, the presence of False Positives corresponding to the Gaussian around 
J° = is suppressed by the Q function, and this is the reason why we observe such a good performance in model 
selection for a > 0.8 in Fig. [5] 
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